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1.  Introduction 


Efforts  to  leverage  the  growing  computational  power  available  using  parallel  programming 
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inexorably  must  address  the  significant  factor  of  heterogeneous  computing.  Multicore  central 
processing  units  (CPUs)  and  various  hardware  accelerators  exacerbate  a  complex  architectural 
landscape  that  inevitably  constrains  parallel  application  design  and  implementation.4  With  the 
underlying  hardware  changes,  code  must  often  be  repurposed  to  remain  viable.  Achieving  high 
perfonnance  on  heterogeneous  systems  with  complex  memory  hierarchies  proves  to  be  a  major 
challenge.  Furthermore,  as  data  movement  dictates  computational  cost  and  power,  efficiency  on 
future  systems  will  require  programming  models  with  program  data  reasoning. 

The  Legion  programming  system  from  Stanford  University  was  developed  to  address  the  specific 
issues  involved  with  parallel  heterogeneous  computing  from  a  data-centric  viewpoint.5  Legion 
defines  a  methodology  whereby  the  developer  can  view  parallelism  through  a  grouping  of 
specially  designated  functions  called  tasks,  each  of  which  will  operate  on  some  portion  of  the 
global  domain  in  parallel.6  This  parallelism  invoked  by  Legion  is  often  referred  to  as  task- 
parallelism  and  is  implicit  in  the  design.5’6 

However,  to  take  advantage  of  the  benefits  afforded  by  Legion  for  parallel  heterogeneous 
computing  it  is  necessary  to  understand  how  the  programming  system  works.  The  intent  of  this 
study  is  to  examine  the  syntax  and  application  building  potential  of  the  Legion  programming 
system  via  an  emblematic  computing  algorithm.  The  n-body  problem,  an  important  part  of  many 
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physics  applications,  was  chosen  as  the  vehicle  to  examine  Legion  in  this  regard.  An  outline  of 
the  rest  of  this  work  follows. 

Section  2  provides  the  reader  with  a  background  into  both  the  n-body  simulation  utilized  by  this 
work  and  the  Legion  programming  system.  This  section  also  describes  the  technical 
specifications  of  the  computing  environment  used  by  this  study.  The  details  of  building  the 
n-body  simulation  using  Legion  are  discussed  in  Section  3.  Section  4  discusses  the  observed 
results,  and  the  final  section  provides  a  conclusion  and  potential  future  directions. 


2.  Background 


This  section  will  briefly  provide  a  background  into  the  n-body  problem  as  well  as  some  pertinent 
information  regarding  the  Legion  programming  system.  Legion  is  a  fairly  flexible  system  that 
allows  developers  to  create  initial  programs  in  a  rather  facile  manner  but  has  the  ability  to  evolve 
into  far  more  complex  structures  and  algorithms.  Therefore,  the  infonnation  regarding  Legion 
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in  this  work  is  far  from  exhaustive,  and  the  interested  reader  is  referred  to  Bauer  et  al5,6  and 
Treichler10  for  more  in-depth  examinations  of  particulars  beyond  the  implementation  of  the  n- 
body  algorithm  used  in  this  work. 

2.1  The  N-body  Problem 

The  n-body  problem  describes  an  approach  to  understand  how  a  set  of  n-bodies/particles  interact 
with  one  another  over  time  and  is  based  on  principles  proposed  by  Sir  Isaac  Newton.1 1  The  n- 
body  problem  is  a  critical  component  to  many  applications  but  generally  falls  under  2  paradigms 
for  computational  modeling:  pair-wise  or  hierarchical  tree-based  solutions.12  14  This  work 
follows  the  pair-wise  computational  model  for  gravitational  simulation  with  an  asymptotic 
computational  complexity  of  C)(/V 2 )  where  N  is  the  number  of  bodies/particles  in  the  system15 
as  perfonnance  is  not  the  intent.  As  illustrated  in  Fig.  1,  the  pair-wise  algorithm  generates  this 
computational  cost  via  the  body-to-body  interactions  calculated  in  lines  6-17. 


1 . 

SET  gdt  TO  GRAVITY  *  TimeStep 

2  . 

SET  EPS  TO  0.0001 

3. 

FOR  T  <=  0  TO  TotalTime 

4  . 

FOR  I  <=  0  TO  N 

5. 

Particle  pO  <=  Particles [I] 

6. 

FOR  J  <=  I  +  1  TO  N 

7  . 

Particle  pi  <=  Particles [J] 

8. 

dx  <=  pi . x  -  pO . x 

9. 

dy  <=  pi . y  -  pO . y 

10  . 

dz  <=  pi . z  -  pO . z 

11 . 

invr  <=  1  /  SQRT(dx*dx  + 

dy*dy  +  dz*dz  +  EPS) 

12  . 

invr3  <=  invr*invr*invr 

13. 

force  <=  pi . mass*invr3 

14  . 

ax  <=  ax  +  force*dx 

15. 

ay  <=  ay  +  force*dy 

16. 

az  <=  az  +  force*dz 

17  . 

END  FOR 

18  . 

Xnew [ I ]  <=  pO.x  +  gdt*p0.velX 

+  0 . 5*gdt*gdt*ax 

19. 

Ynew[I]  <=  pO.y  +  gdt*p0.velY 

+  0 . 5*gdt*gdt*ay 

20  . 

Znew[I]  <=  pO . z  +  gdt*p0.velZ 

+  0 . 5*gdt*gdt*az 

21 . 

pO.velX  <=  pO.velX  +  gdt*ax 

22  . 

pO.velY  <=  pO.velY  +  gdt*ay 

23. 

pO.velZ  <=  pO.velZ  +  gdt*az 

24  . 

END  FOR 

25. 

FOR  I  <=  0  TO  N 

26. 

Particle  p  <=  Particles [I] 

27  . 

p.x  <=  Xnew [I] 

28  . 

p.y  <=  Ynewfl] 

29. 

p . z  <=  Znew [ I ] 

30  . 

END  FOR 

31 . 

END  FOR 

Fig.  1  Pair-wise  n-body  algorithm 
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2.2  The  Legion  Programming  Model 

The  Legion  programming  system  addresses  parallelism  in  a  data-centric  fashion,  providing  a  set 
of  tasks  to  execute  on  defined  logical  regions  in  the  code.5’6,10  These  tasks  are  structured  in  a 
hierarchy  and  will  bind  associated  logical  regions  to  actual  physical  addresses  at  runtime  to  allow 
for  optimal  system  flexibility.  A  top-level  task  is  defined  for  every  Legion  program  but  beyond 
this  the  developer  is  free  to  define  any  number  of  custom  operations  as  long  as  they  are 
registered  by  the  system  prior  to  execution.5’10 

2.3  Registering  Legion  Tasks 

The  top-level  task  is  executed  by  the  Legion  programming  system  first  and  can  be  viewed  as  the 
control  point  for  the  system.  Figure  2  shows  the  general  hierarchy  of  Legion  task  registration 
whereby  the  top-level  task  is  first  followed  by  any  other  tasks  for  the  application  before  the 
invocation  of  the  Legion  start  function.5,6 
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START 

Fig.  2  Legion  task 
registration 
hierarchy 

The  registration  of  Legion  tasks  with  the  runtime  engine  is  accomplished  by  passing  the  defined 
task  to  the  template  of  the  register-Legion-task  object  along  with  a  unique  integer  identifier.5,6 
This  allows  the  Legion  runtime  to  associate  the  defined  task  with  the  globally  unique 
identification  value.6  Once  the  defined  tasks  are  properly  registered  with  the  Legion  runtime,  the 
system  must  start  execution. 

The  Legion  start  function  will  automatically  call  all  underlying  communication  libraries  such  as 
the  Message  Passing  Interface  (MPI)  to  properly  initialize  the  system.5,6,10  The  start  function  will 
look  for  all  defined  and  registered  Legion  tasks  and  execute  them  in  the  predefined  hierarchy 
(see  Fig.  2).  Upon  completion,  Legion  will  automatically  call  underlying  communication 
libraries  clean-up/shut-down  operations  returning  the  resulting  flag  (e.g.,  success  or  failure).6 
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2.4  Launching  Legion  Tasks 

The  defined  and  registered  Legion  tasks,  with  the  exception  of  the  top-level  task,  must  be 
explicitly  invoked  via  the  Task  Launcher  object. TaskLauncher  objects  can  be  passed  either 
single  or  multiple  arguments  with  the  TaskArgument  or  ArgumentMap  objects,  respectively.6 
However,  to  ensure  optimal  flexibility  and  performance,  Legion  follows  the  delayed  execution 
model  and  does  not  bind  arguments  until  the  TaskLauncher  object  is  passed  to  the  actual 
execution  point  in  the  program.5,10 

2.5  Domain  of  Execution 

The  domain  of  execution  for  a  Legion  program  is  defined  as  the  Domain  object  such  that 
structured  and  unstructured  versions  are  available.5,6,10  The  unstructured  domain  is  a  grouping  of 
points  in  the  logical  region  that  the  developer  must  explicitly  call  and  initialize,  whereas  the 
structured  domain  is  represented  by  a  rectangular  object  that  is  implicitly  initialized.6  These 
domain  definitions  are  validated  prior  to  execution  and  represent  logical  regions  that  each  task 
can  access  and  manipulate  as  per  the  dictates  of  the  task  privileges  and  coherence  settings.5,6,10 

The  logical  regions  in  Legion  are  broken  down  into  sets  of  index  spaces  and  field  spaces.5  The 
index  space  can  be  views  as  a  row  in  row-column  descriptions  and  the  field  space  as  columns 
such  that  the  field  space  is  an  applied  description  of  entries  in  the  current  region.6  These  regions 
can  be  shared  by  other  regions  or  can  maintain  exclusivity  as  shown  in  Fig.  3. 


Fig.  3  Legion  region  descriptions 


4 


2.6  Legion  Is  Object-Oriented 

Legion  supports  covariant  and  contra-covariant  bindings,  meaning  that  the  order  of  sub-types 
(e.g.,  polymorphism)  is  preserved  or  could  be  reversed,  respectively.  The  ability  to  function  in 
this  manner  ensures  that  Legion  will  execute  in  an  object-oriented  manner  that  any  C++ 
programmer  should  be  familiar  with.  These  arguments  are  passed  in  and  out  of  Legion  tasks  in  a 
pass-by-value  manner. 

Legion  permits  a  given  task  to  have  declared  side  effects  on  declared  regions  of  data,  resulting  in 
the  ability  of  logical  regions  to  act  as  heaps  with  tasks  themselves  changing  state.  The  design  of 
Legion  allows  task  to  manifest  as  imperative  or  functional  provided  maximum  flexibility  within 
the  context  of  a  C/C++  program.  However,  this  design  means  that  Legion  views  function 
pointers  as  analogous  to  global  variables  and  therefore  the  passing  function  pointers  around  to 
tasks  is  strongly  discouraged. 

2.7  Computing  Environment  Employed 

The  computing  environment  used  for  this  work  is  a  64-node  heterogeneous  cluster  consisting  of 
48  IBM  dx360M4  nodes,  each  with  one  Intel  Phi  5 1 10P  and  16  dx360M4  nodes  each  with  one 
NVIDIA  Kepler  K20M  graphics  processing  unit.  Each  node  contained  dual  Intel  Xeon  E5-2670 
(Sandy  Bridge)  CPUs,  64  GB  of  memory  and  a  Mellanox  FDR- 10  Infiniband  host  channel 
adaptor.16’17 

However,  this  work  does  not  employ  accelerators,  concentrating  instead  on  the  behavior  of  a 
single  compute  node,  e.g.,  Intel  Xeon  E5-2670  limited  to  a  single  processing  core.  The 
application  of  the  Legion  programming  model  and  runtime  library  to  a  single  CPU  job  slot 
isolates  the  shared  memory  behavior,  which  could  then  be  extended  to  the  distributed  parallel 
environment  at  a  later  date. 

Workload  management  for  the  computing  environment  was  implemented  using  IBM  Platform 
LSF  9.1,  which  is  specifically  designed  for  mission-critical  high-performance  computing 

1  o 

environments.  The  operation  was  limited  to  a  single,  nonaccelerator,  job  slot  by  setting  the 
batch  submit  switch  n  to  1  and  not  explicitly  calling  any  Intel  Many  Integrated  Core  Architecture 

devices.  The  BLAUNCH  command  was  employed  to  allow  parallel  operation  within  the  shared 

1 8 

memory  paradigm  for  this  work. 

The  next  section  discusses  the  implementation  of  the  gravitational  n-body  simulation  using  the 
Legion  programming  system. 
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3.  Building  the  Application 


The  first  step  in  building  the  n-body  simulation  employed  by  this  study  is  to  properly  install  the 
Legion  programming  system  itself  as  described  by  Harada.  Legion  can  be  deployed  as  either  a 
shared  or  distributed  memory  system;  the  latter  requires  using  the  GASNet  library  for  defined 
communication  calls,  and  the  reader  is  referred  to  Harada2  for  further  details.  After  downloading 
and  setting  the  Legion  runtime  environment  variable(s),  as  per  Harada,”  the  next  step  is  to  decide 
how  to  partition  the  domain  as  a  set  of  Legion  tasks. 

3.1  Defining  Tasks 

Legion  operates  the  set  of  tasks  as  a  hierarchy  with  the  top-level  task  the  designated  control 
point5;  the  n-body  simulation  actualized  for  this  study  is  no  exception.  The  top-level  task  for  this 
work  employs  an  unstructured  set  of  bodies/particles  and  must  therefore  be  explicitly  initialized 
before  validation.  This  top-level  task  will  instantiate  a  logical  region  that  will  then  be  loaded 
with  data  from  an  external  file  as  a  structure  of  particles  using  a  defined  inline  TaskLauncher 
object  (see  Fig.  4).  Once  complete,  the  physical  region  is  unmapped  so  that  control  may  then 
pass  to  the  next  task  in  the  hierarchy,  i.e.,  the  main  task. 


1 . 

SET  Particles  p  TO  NULL 

// 

array 

■  of  particle  structs 

2  . 

CALL  load  file  data  (file) 

// 

function  loads  external  file 

3. 

CREATE  Logical  Region  lr 

// 

bind 

logical  region 

4  . 

SET  ptr  t  next  TO  0 

// 

base 

of  particles 

5. 

CREATE  TaskLauncher  (WRITE) 

// 

task 

launcher 

6. 

FOR  I  <=  0  TO  N 

7  . 

SET  ptr  TO  next->value++ 

// 

next 

object  in  particle  set 

8  . 

WRITE  ptr  (p  [I] ) 

9. 

END  FOR 

10  . 

UNMAP  REGION 

11 . 

CALL  main-task 

Fig.  4  Top-level  task 

The  main  task  defined  for  this  work  controls  the  outer  loop  of  the  n-body  algorithm  shown  as 
line  3  in  Fig.  1.  Essentially  the  only  operation  this  task  perfonns  is  to  call  the  last  task  in  the 
application  hierarchy,  i.e.,  the  force-update  task  (see  Fig.  5).  The  main  task  must  unmap  and  then 
remap  the  current  physical  region  when  calling  the  force-update  task  as  this  will  allow  Legion  to 
properly  schedule  execution  independent  of  the  enclosing  task. 
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Listing  3. 

1 . 

FOR 

T  <=  0  TO  TOTAL  TIME 

2  . 

UNMAP  REGION 

3. 

CREATE  TaskLauncher 

(force-update,  READ-WRITE) 

4  . 

CALL  force-update 

//  sub-task 

5. 

REMAP  REGION 

6. 

END 

FOR 

Fig.  5  Main  task 

The  force-update  task  for  this  work  is  derived  directly  from  lines  6-17  and  18-23  of  Fig.  1  and, 
as  such,  is  exactly  the  same  algorithm.  Therefore,  for  the  sake  of  brevity,  it  is  not  repeated  at  this 
point. 


3.2  Final  Algorithm 


Once  the  task  partitioning  strategy  has  been  designed,  the  final  program  is  ready  to  deploy  and 
can  be  seen  in  Fig.  6. 


1 . 

CALL 

top-level-task 

2  . 

LOAD  input  data 

3. 

WRITE  particle  data  to  PHYSICAL  REGION 

4  . 

CALL 

main-task 

5  . 

FOR  EACH  Time  Step 

6. 

CALL  force-update-task 

7  . 

END  FOR 

8. 

DONE 

Fig.  6  Final  application  design 


4.  Observed  Results 


The  objective  of  this  work  is  a  better  understanding  of  the  practical  uses  of  the  Legion 
programming  system  with  regards  to  an  example  computing  application  rather  than  performance. 
As  such,  it  is  critical  that  the  validity  of  the  program  itself  be  established.  This  section  discusses 
the  actual  observed  results  of  the  gravitational  n-body  application  gathered  using  Legion  against 
those  from  a  standard  C++  implementation.  Both  of  these  programs  are  executed  against  the 
same  input  files  and  computing  environments. 
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The  data  in  the  Table  shows  the  coordinate  position  of  the  C++  version  of  the  gravitational  n- 
body  program  used  in  this  study  with  that  of  the  corresponding  Legion  application.  These 
coordinates  are  for  an  8-body  system  and  clearly  show  the  correctness  of  the  program. 

Table  N-body  validity  for  8  bodies 


Particle 

X 

Y 

Z 

X-legion 

Y-legion 

Z-legion 

0 

1.66 

9.37 

6.211 

1.66 

9.37 

6.211 

1 

9.19 

3.94 

7.59 

9.19 

3.94 

7.59 

2 

9.36 

0.46 

0.37 

9.36 

0.46 

0.37 

3 

6.70 

8.34 

5.66 

6.70 

8.34 

5.66 

4 

5.26 

9.94 

9.12 

5.26 

9.94 

9.12 

5 

10.00 

0.00 

10.00 

10.00 

0.00 

10.00 

6 

8.60 

8.39 

8.77 

8.60 

8.39 

8.77 

7 

0.00 

10.00 

0.00 

0.00 

10.00 

0.00 

5.  Conclusions  and  Future  Work 


This  work  has  presented  the  Legion  programming  paradigm  within  the  context  of  an  important 
computational  algorithm  employed  in  many  of  today’s  high  performance  computing  research,  the 
n-body  simulation.  The  gravitational  n-body  algorithm  was  defined  using  the  Legion 
programming  paradigm  to  illustrate  the  potential  for  future  research  in  computing  and  has  been 
shown  to  be  both  practical  and  correct  for  this  purpose.  The  Legion  programming  system  is 
flexible  and  introduces  the  developer  to  task-oriented  programming,  free  from  underlying 
computing  architectures  and  communication  procedures. 

Future  directions  for  this  programming  model  are  numerous  and  include  a  more  effective  and 
portable  design  for  parallel  heterogeneous  computing.  In  particular  the  extension  of  this  work  to 
include  the  Fast  Multipole  Method  (FMM)  of  the  n-body  problem  is  an  interesting  direction. 
FMM  is  currently  being  planned  in  concert  with  Stanford  University  as  a  part  of  a  larger  strategy 
of  task-oriented  solutions  to  heterogeneous  parallel  computing. 
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